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S _g Numerous studies and anecdotes demonstrate the "wisdom of the crowd," the surprising 

g ■§ accuracy of a group's aggregated judgments. Less is known, however, about the generahty 

B. g of crowd wisdom. For example, are crowds wise even if their members have systematic 

"S £ judgmental biases, or can influence each other before members render their judgments? If 

^ S so, are there situations in which we can expect a crowd to be less accurate than skilled 
individuals? We provide a precise but general definition of crowd wisdom: A crowd is wise 

"o if a linear aggregate, for example a mean, of its members' judgments is closer to the target 

c ^ value than a randomly, but not necessarily uniformly, sampled member of the crowd. 

^ Building on this definition, we develop a theoretical framework for examining, a priori, 

c when and to what degree a crowd will be wise. We systematically investigate the boundary 

■5 i conditions for crowd wisdom within this framework and determine conditions under which 

'§ S the accuracy advantage for crowds is maximized. Our results demonstrate that crowd 

_| ^ wisdom is highly robust: Even if judgments are biased and con'elated, one would need to 

— ^ nearly deterministically select only a highly skilled judge before an individual's judgment 

■gj could be expected to be more accurate than a simple averaging of the crowd. Our results 

-2 a also provide an accuracy rationale behind the need for diversity of judgments among group 

■g ^ members. Contrary to folk explanations of crowd wisdom which hold that judgments 

^ should ideally be independent so that errors cancel out, we find that crowd wisdom is 

g a maximized when judgments systematically differ as much as possible. We reanalyze data 

a ^ from 2 published studies that confirm our theoretical results. 
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g; g Gallon (1907) provided perhaps the first doc- (Krause, Ruxton, & Krause, 2009) effect when 

umentation of the "Wisdom of the Crowd" (Sur- he analyzed 787 individuals' guesses of the 
owiecki, 2004) or "swarm intelligence" weight of a slaughtered and dressed ox. The 

individuals were entered in a contest at a livestock 
show for which they were charged a small fee, 

thus motivating them to guess well. Because they 

competed for prizes, their discussions of the 
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chology, Fordham University; Jason Dana, School of Manage- individual judgments uninfluenced by the guesses 

ment, Yale University; Stephen B. Broomell, Department of of Others. Some competitors were highly skilled in 

Social and Decision Sciences, Camegie Mellon University. jj^jj, judgment, SUch as butchers and farmers, while 
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It has since been well established that aggre- 
gating judgments or predictions across individ- 
uals can be surprisingly accurate in a variety of 
domains, including prediction markets, political 
polls, game shows, and forecasting (see Sur- 
owiecki, 2004). Under Galton's conditions of 
individuals having largely unbiased and inde- 
pendent judgments, the aggregated judgment of 
a group of individuals is uncontroversially bet- 
ter, on average, than the individual judgments 
themselves (e.g., Armstrong, 2001; Clemen, 
1989; Galton, 1907; Surowiecki, 2004; Win- 
kler, 1971). The boundary conditions of crowd 
wisdom, however, are not as well understood. 
For example, when group members are allowed 
access to other members' predictions, as op- 
posed to making them independently, their pre- 
dictions become more positively correlated, and 
the crowd's performance can diminish (Lorenz, 
Rauhut, Schweitzer, & Helbing, 2011). In the 
context of handicapping sports results, individ- 
uals have been found to make systematically 
biased predictions, so that their aggregated 
judgments may not be wise (Simmons, Nelson, 
Galak, & Frederick, 2011). How robust is 
crowd wisdom to factors such as nonindepen- 
dence and bias of crowd members' judgments? 
If the conditions for crowd wisdom are less than 
ideal, is it better to aggregate judgments or, for 
instance, rely on a skilled individual judge? 
Would it be better to add a highly skilled crowd 
member or a less skilled one who makes sys- 
tematically different predictions than other 
members, increasing diversity? 

We provide a simple, precise definition of the 
wisdom-of-the-crowd effect and a systematic 
way to examine its boundary conditions. We 
define a crowd as wise if a linear aggregate of 
its members' judgments of a criterion value has 
less expected squared error than the judgments 
of an individual sampled randomly, but not nec- 
essarily uniformly, from the crowd. Previous 
definitions of the wisdom of the crowd effect 
have largely focused on comparing the crowd's 
accuracy with that of the average individual 
member (Larrick, Mannes, & Soli, 2012). Our 
definition generalizes prior approaches in a cou- 
ple of ways. First, we consider crowds created 
by any linear aggregate, not just simple averag- 
ing. Second, our definition allows the compari- 
son of the crowd to an individual selected ac- 
cording to a distribution that could reflect past 
individual performance; for example, their skill, 



or other attributes. On the basis of our defini- 
tion, we develop a framework for analyzing 
crowd wisdom that includes various aggrega- 
tion and sampling rules. These rules include 
both weighting the aggregate and sampling the 
individual according to skill, where skill is op- 
erationalized as predictive validity; that is, the 
correlation between a judge's prediction and the 
criterion. Although the amount of the crowd's 
wisdom — the expected difference between indi- 
vidual error and crowd error — is nonlinear in 
the amount of bias and nonindependence of the 
judgments, our results yield simple and general 
rules specifying when a simple average will be 
wise. While a simple average of the crowd is 
not always wise if individuals are not sampled 
uniformly at random, we show that there always 
exists some a priori aggregation rule that makes 
the crowd wise. 

Our results suggest that crowd wisdom is 
robust to different choices of aggregation and 
sampling rules. That is, how one aggregates the 
judgments or chooses an individual judge rarely 
affects the qualitative conclusion that even a 
crowd that is a simple average of judges is wiser 
than the individual. By identifying conditions 
for crowd wisdom, our results also provide 
guidance for constructing an optimally wise 
group — a group whose accuracy most exceeds 
that of its individual members — with two sur- 
prising conclusions emerging. First, a crowd 
becomes wisest when it is maximally informa- 
tive, which entails that its members' judgments 
are as negatively correlated with each other as 
possible, as opposed to being independent. 
Thus, the best judge to add to a crowd is one 
that is maximally different from others. One 
intuitive analogy of this result is to think of the 
group as a financial portfolio; Sometimes it is 
better to diversify performance by "hedging" 
and including an asset that performs well when 
other assets perform poorly. This result pro- 
vides mathematical support for the idea that 
crowds with more diversity are wiser (Hong & 
Page, 2004). Furthermore, our theoretical 
framework provides a mechanism for determin- 
ing when it would be better for the overall group 
prediction to add a group member who, perhaps, 
is less skilled than the alternative members, but 
provides diverse predictions. In other words, 
our framework provides a quantification of the 
accuracy- diversity trade-off. 
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A second surprising conclusion is that while 
the absolute accuracy of the crowd depends on 
the direction and magnitude of members' bias, 
it is almost always preferable to use a weighted 
aggregate of judgments rather than select the 
single best group member, even if the crowd 
members are biased. Unless the best group 
member can be selected deterministically, as in 
certain intellective tasks (Laughlin, 1996), the 
decrease in variance of predictions caused by 
aggregating judgments will offset the bias, a 
manifestation of the well-known bias/variance 
trade-off (Gigone & Hastie, 1997). 

We define accuracy as the average squared 
error of prediction (whether a group or individ- 
ual). This is a common "gold standard" accu- 
racy metric within the field of statistics (Leh- 
mann & Casella, 1998). This accuracy metric 
allows us to derive distribution-free results on 
crowd wisdom. In other words, we make no 
assumptions regarding the underlying distribu- 
tional form of the individual or group's predic- 
tions, such as normality, nor do we impose any 
constraints on the distribution's shape such as 
symmetry or unimodality. Alternative accuracy 
definitions (e.g., average absolute error) can 
change the conclusions of our model, though 
our approach is one that could, in theory, be 
extended to any accuracy metric. 

We present an application of our framework 
to experimental studies by reanalyzing the data 
collected, analyzed and published by Vul and 
Pashler (2008) and Simmons et al. (2011). Our 
analysis finds a "wisdom of the crowd" effect 
when applied to the group of individuals from 
Vul and Pashler (2008), extending the original 
analysis which examined the accuracy of pooled 
repeated judgments within individuals. Our re- 
analysis of the Simmons et al. (2011) data sup- 
ports the overall treatment effect of increasing 
individual bias by manipulating the sports bet- 
ting information available to them. In contrast 
to the original findings reported by Simmons et 
al. (2011), our reanalysis, guided by our new 
formulation, finds an overall improvement of 
the crowd's predictions relative to individuals 
across all treatments in the study. In other 
words, while the members are individually bi- 
ased and the crowd not particularly accurate, the 
crowd is still wise relative to the individual. 

In the next section, we present the general def- 
inition of crowd wisdom and our basic sampling 
assumptions. We then derive a family of inequal- 



ities for evaluating the wisdom of the crowd ef- 
fect. We then analyze several special cases, in- 
cluding comparing an equally weighted linear 
aggregate of the judges to probabilistically select- 
ing an individual judge according to his or her 
skill. We then apply our framework to a reanalysis 
of two data sets. We conclude with a discussion 
and present future directions for this work. 

The General Model 
The Crowd Prediction 

Consider a set of A^-many decision makers 
(DMs), where each DM makes a judgment 
about the unknown value of a criterion. We 
model the criterion being predicted (or esti- 
mated) by the group members as a random 
variable with finite mean and variance. In this 
way, we conceptualize our framework as apply- 
ing to random criteria, as in prediction, as well 
as to the special cases of estimating a single 
fixed quantity (which we accommodate by set- 
ting the variance of the criterion to 0). We take 
this criterion value to be a random variable, Y, 
with mean |jLy and variance ct,,. 

Similarly, we assume that each DM's judg- 
ment is a random variable. This assumption 
represents the variability of a DM who gives 
variable responses to the same task. With this 
assumption, we can model how a DM's predic- 
tions correlate with the criterion as well as other 
DMs in the crowd. Let the prediction distribu- 
tion of the DM be the random variable X, 
with mean and variance CTj.,. 

A crowd prediction, denoted C, is defined as 
the random variable formed by linearly combin- 
ing the DMs according to predetermined 
weights Wj, C = i W/Z,-, with the restriction 
that all w, are non-negative and, to ensure 
uniqueness, j vv,- = 1 . The weights, w,-, are 
not random variables, but rather fixed choices of 
how to combine crowd member judgments. 

At this point, note that we place no a priori 
restrictions on the ix^., and ct^ values. This al- 
lows for the possibility that DMs are biased, 
meaning that their average judgment would not 
equal the average criterion value, E[X[\ = 
¥= jJL,,. Also note that we allow DMs to have 
different prediction variances, ct^,, and arbitrary 
covariances with other DMs where cr^j ^/ de- 
notes the covariance of X, and X/. In other 
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words, the judgments of the crowd members 
may be correlated with each other. In this way, 
we can model the effects of crowd members 
influencing each others' judgments. This ap- 
proach builds upon the seminal works of Hog- 
arth (1978) and Winkler (1981), and our anal- 
yses generalize those of Einhorn, Hogarth, and 
Klempner (1977). 

Finally, note that we place no a priori restric- 
tions on the possible ranges of the covariance 
between Xf and Y, denoted cr^., ,„ other than the 
usual positive semidefinite restrictions on cova- 
riance matrices. In other words, some crowd 
members may have more skill than others in that 
their judgments are better related to the crite- 
rion. 

To fix these ideas, our framework could be 
used to evaluate the following types of tasks: 

1. A group of A^-many financial analysts that 
predict the weekly changes (or absolute 
changes) in the value of a market index (such as 
the DOWJI), or the exchange rate of the U.S. 
dollar and the Euro. 

2. A group of A^-many sports prognosticators 
who predict the number of points scored every 
week in all NFL games, or the number of goals 
scored in the Bundesliga. 

3. A group of A^-many weather forecasters who 
predict the total amount of monthly rain, or the 
average monthly temperature in a given location. 

4. A group of A'-many economists predicting 
the probability that the unemployment rate next 
month will be below 8%. 

In all of these cases, we have a random target 
variable (criterion) and repeated random predic- 
tions from multiple judges. The individual fore- 
casts and observed realizations of the correspond- 
ing random variables allow straightforward 
estimation of all the parameters (means, variances 
and covariances) that play a role in our model. 

We clarify that our definition and analytic 
results are defined over a single, abstract pre- 
diction task. Often, one is interested in the wis- 
dom of the same crowd across multiple, distinct 
prediction tasks. In Section 5, we demonstrate 
how our theory can be extended to such cases 
by adding some additional assumptions on 
crowd behavior for our reanalysis of the Vul 
and Pashler (2008) data set. This allows for an 
application of the theory to a wide range of 
empirical data sets. 

Our model and results are limited to so-called 
"statisticized" groups, where the crowd is 



merely a mechanical linear aggregation of indi- 
vidual judgments, as opposed to, for example, 
freely interacting deliberative groups like juries 
or structured group interactions (e.g., Delphi 
method; Linstone & Turoff, 1975). While this 
focus is perhaps somewhat limited, it is consis- 
tent with much of the literature on crowd wis- 
dom (although see Merkle & Steyvers, 201 1, for 
a Bayesian aggregation model using nonlinear 
weights). Our definition of a "crowd" prediction 
as a linear combination of group member pre- 
dictions contains the simple group average as a 
special case. Our approach can be seen as a 
generalization of several previous approaches, 
such as comparing the group average with an 
individual selected uniformly random (Einhorn 
et al., 1977; Wallsten & Diederich, 2001). We 
extend these approaches by considering other 
special cases, such as the one where the proba- 
bility of selecting an individual is proportional 
to that individual's expected performance (mea- 
sured by the correlation with the criterion vari- 
able). 

Prediction of an Individual Selected 
Randomly 

We consider whether the crowd's judgment 
is expected to be better than an individual crowd 
member's. Let P be the random variable formed 
by selecting a single member of the crowd 
probabilistically, and let /?, denote the probabil- 
ity of selecting the crowd member, with 
p, > 0, V; G {1, 2, . . . , Af} and 2f=i = 1- As 
a special case, if all values are equal, that is, 
1 

Pi = —, Vi G {1, 2, . . . , N\, then P reduces to 

selecting any individual DM with equal probabil- 
ity. At the other extreme, if = 1 with = 
0, V/ G {1, 2, . . . , N}, i # A:, then the k!'' DM is 
selected with probability one, for example, the 
highest performing group member is known. In a 
later example we consider the case where is 
proportional to the DM's correlation with Y. 

A Wisdom of the Crowd Criterion 

We consider the expected squared loss be- 
tween each prediction distribution and the cri- 
terion distribution Y throughout. We compare 
the values E[(C - Yf] and E[{P - Yf] to one 
another, where "£[•]" is the expectation opera- 
tor. In other words, the prediction model that 
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comes closest, on average, to Y is considered to 
be more accurate. This accuracy criterion, ex- 
pected squared-error, is only appropriate for 
tasks in which "close-ness" of a prediction or 
judgment can be evaluated on a continuous 
scale (see Lee, Steyvers, de Young, and Miller, 
2012; Yi, Steyvers, Lee, & Dry, 2012, for recent 
approaches to modeling crowd wisdom for 
combinatorial and ranking tasks, which would 
not meet our modeling assumptions). 

We define a wisdom of the crowd ejfect to 
hold if, and only if. 



E[{C-Yf]r^E[{P-Y)\ (1) 



for some crowd aggregate weights, w,-, ; G 
{1,2,..., N], and selection distribution proba- 
bility weights p„i^{\,2, . . . ,N}. Note that 
the right-hand side of Inequality (1) is the ex- 
pected accuracy of selecting an individual ac- 
cording to an arbitrary, prespecified probability 
distribution, in contrast to previous formula- 
tions such as evaluating the arithmetic mean 
accuracy of individual predictions (Larrick et 
al., 2012). 

Let \Xjx be the A' X 1 vector of the DMs' mean 
predictions. Let be the covariance matrix of 
the Z,, ! G {1, 2, . . . , N}, random variables. Let 
CT^.,, denote the N X 1 vector of covariances of Y 
with each Z„ / G {1, 2, . . . , A^}. It is straightfor- 



ward to show that E[{C 
following: 



F) ] is equal to the 



E[{C - Yf] = (tJL> - ^JL,,)2 + w'X^^w 



where w is the N X 1 vector of weights, 
Wi,i G {1,2, ... , N}, defining C. 

Next, we consider the random variable (P ~ 
Yf. An application of the iterated expectation 
theorem (e.g., Bickel & Doksum, 2001) yields: 



EKP - Y)^] = J,p,[(l^,i - l^yf + - 2a,,, 



1=1 



Proposition 1. (Wisdom of the Crowd 
Effect). The aggregate crowd prediction dis- 
tribution, C, defined by w has lower expected 
loss than an individual judgment selected ac- 



cording to the probability measure, /?,, i'G 
{1,2,..., N], if, and only if, the following in- 
equality holds: 

N 

- 2 P{i\^xi - + o\i - 2(Tv,.t/ + o-^] . (2) 

/=1 ' ' ' 

It is possible to rearrange Inequality (2) in a 
way that separates clearly the various factors 
that drive the effect. By rearranging terms, we 
can simplify this expression as follows. 



2 W',w/0-„;^; + M-A/M-x^) ^ 2^ (W/ - Pi){\^y\i'xi 
ij=l i=l 

i*i 

N 

+ <i.ri,.)-S(v^/ + (3) 

/=i 

If we additionally assume that = 0 we 
obtain: 



N 

2 ^i^jiS^xi^j + \^xi\y-xj) ^ 2^ {Wi - pd^x 
i.j=l i=l 



N 

1=1 



S(w^-/7,.)mS£,,.. 



The right-hand side of this inequality focuses 
on the N individuals in the crowd. In particular, 
this expression highlights the effect of the dif- 
ference between the weights assigned to indi- 
viduals in the crowd (w,) and the probabilities of 
selecting these individuals from the crowd (p,) 
on the individual mean squared errors of the 
individual judges and the individual judges' co- 
variances with the criterion. This expression is 
maximized when individual judges with high 
covariances with the criterion (cr,,;i.,) and low 
individual mean squared errors are over- 
weighted in the crowd, relative to their proba- 
bility of selection. On the other hand, the left- 
hand side of the inequality is independent of 
the criterion, Y, and reflects only the interrela- 
tion between the various judges in the crowd 
and their relative weights. It is minimized when 
there is an inverse relationship between 
EiXjX^) = (T^i -^j + WjWj, that is, when 

pairs of judges with high (low) E{xiXi) are as- 
signed relatively low (high) weights. 
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The above proposition provides an explicit, 
testable condition to determine whether a given 
crowd is wise. In the following special cases, 
we demonstrate how this result can be used to 
evaluate the relative trade-offs of group member 
interdependence versus bias. First, we prove a 
basic result within our framework. 

Result 1. Consider the case when w, = 
Pi, Vi G {1, 2, . . . , A^}, that is, the aggregation 
weights providing the crowd prediction are 
identical to the selection weights used to deter- 
mine the individual DM prediction distribution. 
Then a wisdom of the crowd effect always 
holds. 

Proof. See Appendix. 

Result 1 extends the finding that the crowd 
member average is more accurate than the av- 
erage crowd member to any situation in which 
the aggregation weights are identical to the 
probability weights used to select the individ- 
ual. As in previous, related arguments (Dawes, 
1970; Hogarth, 1978), Result 1 is a straightfor- 
ward application of Jensen's inequality. 

While Result 1 guarantees the existence of an 
aggregation method that makes the crowd wise, 
it is interesting to consider particular aggrega- 
tion rules, such as an unweighted aggregate, 
against other methods of selecting individuals. 
For example, a practical problem that one might 
face is to choose between the judgment of an 
available expert and a crowd. Without exact 
knowledge of the relevant crowd parameters, it 
is difficult to decide on aggregation rules other 
than a simple average. We can still run through 
scenarios involving a simple crowd average, 
however, to see how likely it would be to find 
an expert that is more accurate. Furthermore, we 
may wonder about the extent of crowd wisdom 
and factors that lead to maximizing the crowd's 
predictive advantage over the individual. We 
next provide some comparative analyses using 
Inequality (1) to demonstrate factors that max- 
imize crowd wisdom. In the section following, 
we examine special cases in which the aggre- 
gation weights do not match the selection 
weights. 

Unweighted Average Versus Selecting an 
Individual at Random With Equal 
Probability 

Consider the simple case where the crowd, C, 
is defined by the unweighted (simple) group 



average, w,- 



N' 



i e {1, 2, . . . , A^}, and the 



competitor model P is defined by the uniform 
1 

distribution p.- = — , ; G (1, 2, . . . , A^; that is, 

the competitor individual is selected uniformly 
at random. This models the case where one has 
no prior information to suggest or reason to 
believe that any member of the group is any 
better at the prediction task than any other. 
Inequality (2) can be rearranged and written: 



I N N IN I N 



i=\j=\ 



1 N Y 

A' 1=1 ' / 



Recall that this inequality is simply an alge- 
braic rearrangement of the inequality, E[C — 
Y)^] < E[{P - Y)\ and thus, the magnitude to 
which (4) deviates from equality is precisely the 
expected difference between the random vari- 
ables (C - Y)^ and (P - Y)^. The greater the 
deviation from equality in (4), the more pro- 
nounced the wisdom of the crowd effect. 

What does the composition of the "crowd" look 
like when the inequality 0 < E[{P - Yf] - 
E[C — y)'] is maximized? When it is mini- 
mized? First, consider the left-hand side of in- 
equality (4). Because the covariance matrix of 
the judges is positive semidefinite, this side of 
the inequality is necessarily nonpositive. To 
simplify matters, assume that all predictions are 
standardized such that the covariance between 
judges X[ and X2 can be interpreted as correla- 



tions, r^ij. The left-hand side of inequality (4) is 
maximized when all the judges are perfectly 
correlated with each other, that is. 



N 
N 



0. As the judges become less correlated 



with one another, this value becomes smaller 
and the wisdom of the crowd effect becomes 
more pronounced. This result is intuitive be- 
cause if all the judges provide the same (or 
almost the same) predictions, there would be 
little gained by aggregation. Note that when 
crowd members' judgments are highly corre- 
lated, adding in a new member whose judg- 
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ments are nonredundant improves crowd wis- 
dom, particularly for a small crowd. As is 
evident from Inequality (4), this improvement 
can occur even if the new, nonredundant mem- 
ber has substantially lower skill than the exist- 
ing members. The intuition behind this result is 
clearer when one considers other linear aggre- 
gation problems: If one has highly redundant 
predictor variables in a multiple regression, 
adding a predictor that gives new information 
will help the model even if the new predictor is 
poorly correlated with the outcome variable. In 
the context of multiple regression this effect is 
similar to "suppression" in that a new member 
can improve the crowd's estimate via his or her 
relationship with other judges and not the crite- 
rion per se (see Tzelgov & Henik, 1991). 

It is worth noting that while the left-hand side 
is smaller for perfectly uncorrelated judges, it is 
not minimized in this case. This term is mini- 
mized when all judges are equally and maxi- 
mally negatively correlated with one another. 
This result is distinct from the folk explanation 
of crowd wisdom which holds that indepen- 
dence among judges is necessary so that errors 
cancel out (variance reduction does factor into 
the right-hand side of this inequality, as shown 
below). This result is in line, however, with 
other mathematical models demonstrating 
how group diversity can improve overall 
group accuracy (Hong & Page, 2004). To 
clarify, this maximal negative correlation is 
subject to the usual positive semidefinite con- 
straints on the interjudge correlation matrix, 
which, for large crowds, will be necessarily 
very small. As the number of judges goes to 
infinity, the maximal negative correlation of 
all judges approaches zero. 

While the left-hand side of Inequality (4) 
describes the effects of intercorrelation between 
judges, the right-hand side describes the effects 
of judge bias, that is, the expected squared de- 
viation between a judge's prediction and the 
criterion. This side of the inequality must 
necessarily be non-negative. This term, 

-^f=li^xi - ^^yf - (^-^f=li^xi - . is 

minimized when the true means of all judge pre- 
dictions are equal to the mean of the criterion, that 
is, when all judges are unbiased. In this case, there 
is little benefit to aggregating the judges from the 
standpoint of minimizing bias because all of them 



are unbiased. All aggregation can do in this 
situation is reduce the variance of the aggregate 
prediction (Wallsten & Diederich, 2001). Maxi- 
mizing this term is far more interesting. The right- 
hand side of (4) becomes arbitrarily large as the 
average squared bias of the judges becomes large 
with the squared bias of the judge average remain- 
ing small or zero. In this case, all judges are 
(possibly greatly) biased in their individual pre- 
dictions, yet the average of their predictions is 
very close to the true criterion. Put in the termi- 
nology of Larrick and Soil (2006), the individual 
judge predictions "bracket" the true criterion 
mean, falling, more or less, both above and below 
|JL,,. Here, the individual predictions systematically 
fall either above or below |x,,, but when averaged, 
this individual bias is cancelled and the crowd 
prediction is wise. 

Unweighted Average Versus Selecting an 
Individual According to Their Skill 

In the previous section, we considered the 
simple case where one does not discriminate 
between the individual judges a priori. Prior 
work has demonstrated that for intellective 
tasks without demonstrable solutions, groups 
are often poor at identifying the highest per- 
forming member (Henry, 1995). While infor- 
mative, this case may not be general. Often, we 
do have information on the prior performance of 
judges; that is, their skill at a particular predic- 
tion task, for example Cooke's method (Cooke, 
1991). In this section, we compare different 
aggregation weights, w,, to a randomly selected 
individual such that the probability of selecting 
the DM is proportional to that judge's skill, 
defined as the correlation of his or her predic- 
tion with the criterion, Y. Let all judges' skill be 
non-negative, r„,, > 0, V; G {1, 2, . . . , A^}, and 
let Pi be defined as follows: 



(5) 



Clearly, the higher the correlation of an indi- 
vidual's predictions with the criterion, the more 
likely that individual will be selected. If all DM 
predictions are equally correlated with Y then 
this choice of P will reduce to selecting a DM 
uniformly at random, which has already been 
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shown in Result 1 to be inferior, on average, to 
the case where C is the unweighted average. At 
the other extreme, if only a single DM's predic- 
tions correlate with Y then that DM will be 
chosen with probability one, and will, most 
likely, outperform the unweighted average of 
the crowd. 

Let C be defined according to the simple 
unweighted average and let P be defined ac- 
cording to (5). Applying Proposition 1 and re- 
arranging terms gives us the following result. 

1 

Corollary 1. Let vv, = -, V; G {1, 2, . . . , 

M\ and let p/ be defined as in Equation (5). As- 
sume that all variables are standardized such 
that CTj-, ,, = r„- and cr„ y = r^j ^, Vi, /'. Then a 
wisdom of the crowd effect holds if, and only if, 

N Y . 
i=\ 

< 2 / Mefl«(r,,,,) - 2 -^T^ \ , (6) 

where MSE^ is the mean squared error for the 
DM prediction distribution and MSE^.^„^^.^ is 
the mean squared error for the crowd predic- 
tion, C. 

Mean squared error is equivalent to the sum of 
the prediction distribution's squared bias (with 
respect to and its variance. Examining the 
right-hand side of Inequality (6), we see that the 
crowd prediction benefits when skill is evenly 
distributed among the DMs; in other words, when 
all DMs are "equally good." The left-hand side of 
Inequality (6) indicates that for the crowd to do 
well, the most highly skilled DMs (those with the 
highest correlations with the criterion) should also 
have the largest biases. 

One DM Doesn't Follow the "Herd": 
Unbiased Case 

Consider the case of a defector, "one DM 
who doesn't follow the herd" model assuming 
that C is defined by the unweighted group av- 
erage and P is defined as in (5). In this case, we 
will assume that there are A^-many judges with 

- 1 DMs who positively correlate with the 



criterion, Y, and each other at the value \\). This 
group of - 1 DMs represents the "herd." The 
remaining DM is the "dissident" and correlates 
with the criterion Y at cf). To ensure the positive 
semidefiniteness of the intercorrelation matrix 
between judges and the criterion, we will also 
assume that the dissident correlates with the 
herd judges at For now we assume that all 
A^-many DMs are unbiased in their predictions, 
that is, |x^,- = |x^„ Vi G {1, 2, . . . , A^}. We will 
consider biased cases in subsequent sections. 

Clearly, when i|j = 0 and c|) is large, we 
have a group of uncorrelated DMs and only 
the dissident DM has any skill at predicting 
the criterion variable.' Under this set of as- 
sumptions, the dissident is selected with prob- 
ability 1 under P, and will be more accurate, 
in expectation, than the group aggregate C. At 
the other extreme, if if; = 4) we have a group 
of equally skilled DMs who are equally cor- 
related with one another. Under this condi- 
tion, P, as defined in (5), reduces to selecting 
one of the DMs with equal probability and 
will always do worse, on average, than the 
unweighted aggregate, C, by Result 1. To 
investigate this relationship, we consider 
three different levels of "skill" on the part of 
the dissident, cf) = .95 (high), 4) = .70 (me- 

dium) and cf) = .40 (low), and vary the ratio — 

from 0 to 1 . As an application of Corollary 1 , 
let LHS be the left-hand side of Inequality 
(6), similarly, let RHS be the right-hand side 
of Inequality (6). Thus, the value R = RHS/ 
LHS is the ratio of individual expected loss to 
crowd expected loss. For ease of presentation, 
our results are in terms of /? on a logarithmic 
scale, denoted log(/?). Log(/?) provides a con- 
tinuous measure of the wisdom of the crowd 
effect with positive (negative) numbers indi- 
cating the crowd is expected to be more (less) 
accurate than an individual chosen at random. 
When log(/?) = 0, the expected loss of the 
crowd is equal to the individual expected loss. 
Figure 1 plots log(/?) as a function of the ratio 



' As N increases without bound, ^Sf equals the minimal corre- 
lation between the dissident and herd DMs that guarantees that the 
resulting correlation matrix is positive semidefinite. 
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(j)=.95 = .70 i^=.40 

^ I I ^ I I ^ 




0 0.5 1 0 0.5 1 0 0.5 1 

vi//(j) 

Figure 1. This figure plots log(fi) as a function of tlie ratio — for 4> = .95, .70, .40. For each 

value of 4>, log(i?) values are plotting for five samples sizes, N = 5, 10, 15, 20, 25. The 
left-hand graph corresponds to <)) = .95, the middle graph corresponds to cj) = .70, and the 
right-hand graph con'esponds to <|) = .40. The line, log(/?) = 0, is plotted for reference. All 
points above this line represent cases where the crowd's peiformance is superior. 



— for group sizes N = 5,10,15,20,25 sepa- 

rately for the different skill levels of the dis- 
sident. The line denoting equal accuracy of 
the crowd and a randomly selected individual 
is plotted for reference. 

Note that the simple unweighted average, C, 
performs quite favorably compared with an in- 
dividual selected at random according to (5), 
even in the case where the dissident has a high 
correlation with the criterion. Selecting an indi- 



vidual according to (5) outperforms the crowd 
in this case only when the herd is very weakly 
correlated with the criterion, for example, the 
point at which C outperforms P for N = 15 
under cf) = .95 occurs when \\i = .052. The size 
of the groups plays a large role in determining 
the point at which the wisdom of the crowd 
effect emerges. As expected, the larger the 
group size the more pronounced the wisdom of 

the crowd effect and the smaller the value of — 
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at which it emerges. There is also a strong effect 
of the skill level of the dissident. As c|) decreases, 

the favorable range of the ratio — under P be- 

comes quite small and, eventually, vanishes (cf) = 
.40). In this case, even if one could select deter- 
ministically the best member of the group, their 
modest correlation with the criterion would not 
offset the reduction in sampling error by incorpo- 
rating an unweighted average of the rest of the 
DMs. To summarize, unless one could nearly de- 
terministically select the best member of the 
group, who must be highly skilled, a simple un- 
weighted group average will, on average, prevail. 

It is interesting that the performance of C 
versus P under these assumptions is highly non- 

linear. Under the extreme case of — = 1, a 

wisdom of the crowd effect is guaranteed to 
occur by Result 1, yet this is not the condition 
that yields maximal log(7?) values. Across all 
three conditions, the largest values of log(7?) 
occur when the herd is modestly, but not max- 
imally, correlated with the criterion. Under 

these values of — , the dissident has a reasonable 

chance of not being selected with the remaining 
group members having relatively smaller pre- 
diction correlations with the criterion. Yet, there 
is a large amount of information present in the 
group as a whole, as measured by small judge 
intercorrelation, so the prediction of C will 
likely perform extremely well. 



One DM Doesn't Follow the "Herd": 
Biased Case 

Let us return to the "one DM doesn't follow 
the herd" model analyzed above but allow the 
DMs in the herd to be biased in their predic- 
tions. Intuitively, the wisdom of the crowd 
effect, as defined in Proposition 1, should 
depend not just on the magnitude of the DM 
biases but also on their relative configuration. 
For example, a herd in which the DM biases 
are equally likely to be above/below |x,, will 
likely result in different log(7?) values than a 
herd in which all DM biases are in the same 
direction. 

We consider two cases. In the symmetric 
case the prediction biases of the herd DMs are 
no more likely to be above than below jx^ with 
the dissident DM as the sole member whose 
predictions are unbiased. Table 1 displays the 
]±x values for this model under the symmetry 
condition for the five group sizes, N = 5, 10, 
15, 20, 25. Recall that because (jl,, is defined to 
be zero, it suffices to specify to model 
bias. As before, we assume that all prediction 
values are standardized so that values of 
can be interpreted as bias in units of standard 
deviations. As shown in Table 1, the number 
of DMs in each group with positive or nega- 
tive biases are roughly equal and symmetric 
with respect to bias magnitude. As group size 
increases, the magnitude of the biases also 
increases, with a maximal bias of plus or 
minus two standard deviations. 



Table 1 

Bias Configurations for the Five Hypothetical Groups Under the Symmetry Condition 

Number Possible bias values 

of 

decision 0 .5 -.5 1 -1 1.5 -1.5 2 -2 

makers 

(DMs) Counts per group 



N = 5 


1 


1 


1 


1 


1 


0 


0 


0 0 


N = \Q 


1 


3 


2 


2 


2 


0 


0 


0 0 


N = \5 


1 


3 


3 


3 


3 


1 


1 


0 0 


N = 20 


1 


4 


3 


3 


3 


3 


3 


0 0 


N =25 


1 


3 


3 


3 


3 


3 


3 


3 3 


Note. The columns indicate the possible 


bias values 


we 


consider, as well as the number of DMs in each group with that 


bias level. For example, the 


group with 5 DMs has one unbiased DM, with the 


remaining 


; 4 DMs having 


; bias levels of .5, 



-.5, 1, and -1. 
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Figure 2 displays log(R) values as a func- 
tion of the ratio — under the symmetry con- 

dition. All other assumptions are identical to 
the analysis in the previous section. The 
log(/?) values are much larger in this condi- 
tion than the completely unbiased case previ- 
ously examined, likewise, the size of the 
group has a more pronounced effect on how 
"wise" a group is, with larger group values 
resulting in larger \og(R) values. In other 



words, given that the dissident is the only 
unbiased DM in the group, selecting an indi- 
vidual probabilistically incurs a much higher 
penalty. However, from the perspective of the 
crowd prediction C, the bias penalty is aver- 
aged out, because the biases of the individual 
members are symmetric about |Xy, similar to 
the bracketing effect of Larrick and Soil 
(2006). In this scenario, aggregation can only 
help to lower prediction variance, hence the 
more extreme wisdom of the crowd effect. 



(|) = .95 



< "a 




'I' 

Figure 2. This figure plots log(fi) as a function of the ratio — for c|) = .95, .70, .40. For each 

value of (j), log(S) values are plotting for five samples sizes, N = 5, 10, 15, 20, 25. The 
left-hand graph corresponds to <|) = .95, the middle graph corresponds to <)) = .70, and the 
right-hand graph corresponds to <|) = .40. The line, log(/?) = 0, is plotted for reference. All 
points above this line represent cases where the crowd's performance is superior. 
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Next, we examine another version of the model 
with identical assumptions except that the bias 
configuration of the DM predictions is asymmet- 
ric. For example, all judges systematically over- 
estimate the probability of a rare event, such as the 
probability of a high intensity earthquake. We set 
the [i.^ vectors equal to those defined in Table 1, 
with the exception that we consider the absolute 
values of all entries for all For this model, aU 
nondissident DMs are systematically positively 
biased in their predictions with respect to |x^, with 



bias values ranging from .5 to 2 standard devia- 
tions. 

Figure 3 displays these \og{R) values as a 

function of — for c|) = .95, .70, .40. As ex- 
<\> 

pected, the wisdom of the crowd effect is less 
extreme than in the symmetry condition. Al- 
though these log(/?) values are smaller than in 
the symmetry condition, the overall magnitudes 
of the \og{R) values and their relationships to 



< "a 




'I' 

Figure 3. This figure plots log(fi) as a function of the ratio — for c|) = .95, .70, .40. For each 

value of (j), log(S) values are plotting for five samples sizes, N = 5, 10, 15, 20, 25. The 
left-hand graph corresponds to <|) = .95, the middle graph corresponds to <)) = .70, and the 
right-hand graph corresponds to <|) = .40. The line, log(/?) = 0, is plotted for reference. All 
points above this line represent cases where the crowd's performance is superior. 
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the ratio — are similar to that of the completely 

unbiased condition. This result speaks strongly 
to the general robustness of the unweighted 
average. Even in the face of highly and unidi- 
rectionally biased members, it is still often pref- 
erable to simply average the members as op- 
posed to selecting a single, best-performing 
unbiased one. 

Applications of the Model to Real Data 

In this section, we reanalyze data from two 
papers that investigated the wisdom of crowds 
using two different types of tasks (trivia questions 
and sports betting) with different types of data 
(continuous estimates and dichotomous choices). 
The first analysis applies our framework to the 
data from Vul and Pashler (2008) to estimate the 
expected loss of a crowd versus an individual 
selected at random. This analysis illustrates how 
our theory could be extended to multiple predic- 
tion/estimation tasks. The second analysis appUes 
our framework to the data from Simmons et al. 
(2011) to estimate the wisdom of a crowd whose 
members repeatedly predicted sports outcomes 
against a point handicap. We use our framework 
to highlight the impact of member bias induced by 
the four different experimental conditions and 
demonstrate how bias impacts the performance of 
a crowd. 

Estimation of Expected Loss 

Recall that Inequality (2), our criteria for a 
crowd to be wise, provides a breakdown of 
expected squared loss from a crowd versus a 
randomly chosen individual, and is as follows. 



N 



The left-hand side (LHS) of this inequality is 
a linear combination of (a) the crowd level bias, 

(b) covariances among crowd members, and (c) 
the crowd covariance with criterion, and (d) the 
variance of the criterion. The right-hand side 
(RHS) of Equation 2 is the linear combination 
of (a) the bias of each individual's estimates, (b) 
the variance of each individual's estimates, and 

(c) the covariance of each individual's estimates 



with the criterion, and (d) the variance of the 
citerion. For each dataset, we estimate each of 
the components of LHS and RHS. As in the 
previous section, we will evaluate crowd wis- 
dom via log(7?) where R = RHS/LHS, which 
we estimate for each dataset. We compare the 
estimate of \og{R) to a previously established 
measure of crowd performance: the percentage 
of individuals that are less accurate than the 
crowd (Simmons et al., 2011). This percentage 
is determined by comparing the mean-squared 
error (MSB) of each individual with the MSB of 
the crowd. It is important to note that by moving 
from the theory to empirical data, we must 
estimate the parameters of interest and, subse- 
quently, \og(R). Hence, we also need to be 
concerned with sampling error with respect to 
parameter estimation. To impose as few as- 
sumptions as possible on the empirical data, we 
carried out a jackknife procedure (Miller, 1974) 
to estimate this variability. Alternative methods 
could also be used, for example, Bayesian esti- 
mation, with additional distributional assump- 
tions on DM behavior. 

Reanalysis of Vul and Pashler (2008) 

The theory we have developed in previous sec- 
tions was defined over a single, abstract prediction 
task. To illustrate how our theory could be applied 
to the more general case of multiple tasks, we 
consider experimental data from Vul & Pashler 
(2008). To be clear, we are not proposing a theory 
of crowd wisdom across multiple tasks per se, 
rather we are suggesting one possible method of 
extending our approach. Vul and Pashler ran a 
study with multiple tasks in the form of 8 trivia 
questions. Our analysis of this data set will require 
additional assumptions on the questions. Given 
the similarity of the questions, we considered 
these questions as a sample of questions drawn 
from a universe Y of questions that could have 
been selected. We introduce a new index, j, for 
estimating the parameters from this population of 
questions. Considering the questions as a random 
sample from Y allows this collection of questions 
to be treated as a random variable with mean, jjl^„ 
and variance, &~. 

Hence, we are making inferences at the level 
of the population of possible questions. The 
biases and covariances of the DMs are defined 
at the level of the random variable Y. We do not 
assume that DMs have stationary biases with 
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respect to each trivia question; rather, each DM 
has a bias with respect to the average answer 
from the universe of possible trivia questions, 
|jLy. All questions from Vul and Pashler were on 
a similar scale (responses from 0-100). 

Vul and Pashler (2008) used data from = 428 
subjects who provided estimates (from 0 to 100) 
to / = 8 questions. Each subject provided two 
responses (immediately; delayed by three weeks); 
the current analysis uses only the immediate re- 
sponse data. Each of the subjects, / = 1, . . . 428, 
produced judgments for each of the questions, y = 
1, ... 8, denoted jt^. The answers to the 8 ques- 
tions are denoted as yj. Vul and Pashler tested the 
wisdom of the crowd by comparing individual 
versus group mean squared error (MSEJ) across the 
8 questions. 

Inequality (2) requires an estimate of the 
mean and variance of the criterion Y that are 
computed as the sample mean and sample vari- 
ance of the 8 true answers given by, 

1 8 

P-v = 3' = o2 = 32.64, 

8;=1 



and 



S(}',->0' = 556.11. 



' (8 - l)j 

Next, we estimate the mean judgment from 
each individual and the covariance between 



the judgments produced by all pairs of indi- 
viduals as. 



1 8 

jl^. = Xj = Xij for ; = 1, ... , A^, 
8^=1 



and 
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Finally, the covariances of the judgments 
with criterion variable are computed in the same 
way for each individual using the 8 judgments 
and answers; 



1 ^ 



■.x,y 



■y)- 



Results 



We computed the estimates for the LHS and 
RHS of Inequality (2) and compared the measure 
log{R) with the commonly used post hoc method 
of computing the percent of individuals beat by 
the crowd in Table 2. We manipulate the defini- 
tion of the crowds' judgment by (a) creating sev- 
eral subgroups of the original crowd and (b) ap- 
plying different weighting criteria for the crowd 
prediction. The first row is a comparison of the 



Table 2 

Estimates for Expected Loss of Crowd Versus a Randomly Selected Individual 





Expected loss estimate 








LHS 


RHS 




Proportion of individuals 


Weights (w) 


(Crowd) 


(Individual) 


log(R) (SE) 


beat by the crowd 


Equal weights 










100% of crowd 


131.23 


608.58 


1.53 (0.07) 


0.96 


Most valid individual 


112.18 


608.58 


1.69 (0.08) 


0.99 


Most valid 5% 


60.54 


608.58 


2.31 (0.04) 


1.00 


Most valid 25% 


52.95 


608.58 


2.44 (0.07) 


1.00 


Most valid 50% 


57.36 


608.58 


2.36 (0.09) 


1.00 


Least valid 50% 


299.25 


608.58 


0.71 (0.06) 


0.78 


Least valid 25% 


429.05 


608.58 


0.35 (0.06) 


0.62 


Least valid 5% 


722.55 


608.58 


-0.17(0.04) 


0.30 


Least valid individual 


1251.11 


608.58 


-0.72 (0.04) 


0.08 


Unequal weights 










Proportional to validity 


87.12 


608.58 


1.94(0.07) 


1.00 


Inversely proportional to validity 


261.96 


608.58 


0.84 (0.06) 


0.83 



Note. Estimates of log(R) are accompanied by SEs produced from the jackknife procedure. 
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equally weighted crowd against individuals drawn 
randomly with equal probability. The next eight 
rows in Table 2 show the results of producing 
subsets of the original crowd by equally weighting 
the judgments produced by subsets of individuals 
based on their validity. For example, the most 
valid individual crowd is created by giving only 
the most valid crowd member a weight of 1, and 
putting zero weight on the remainder of the 
crowd. The most valid 50% crowd was created by 
weighting equally those ranked in the top 50% 
based on validity and giving a weight of zero to 
the bottom 50%. Each crowd in Table 2 is com- 
pared with the expected loss of randomly selecting 
an individual from the entire sample. The bottom 
two rows apply unequal weights to the entire 
crowd that are proportional to each individual's 
validity. 

The columns of Table 2 provide estimates of 
expected loss (in the first two columns), the mea- 
sure of crowd performance, log(/?), in the third 
column, and the percent of individuals beat by the 
crowd in the last column. We also include the 
standard error of the estimate of log(/?) computed 
using the Jackknife procedure.^ The values of 
log{R) correspond nicely to the percent of individ- 
uals beat by the crowd (Pearson r = 93; Kendall 
T = .94). Table 3 presents a breakdown of the 
expected loss estimate based on (a) crowd bias, (b) 
crowd covariance, and (c) crowd covariance with 
criterion. This breakdown shows the source of 
changes in the estimates of the crowd expected 
loss. We can see that the marginal improvement of 
the most valid individual is explained by the very 
high crowd covariance term despite having lower 
bias and higher covariance with the criterion. 

The values from this example in Table 2 
show that the crowd's expected loss only ex- 
ceeds the individual expected loss in very ex- 
treme cases (i.e., the crowd weighting includes 
the 5% least valid individuals). These results 
demonstrate the robustness of the wisdom of 
this particular crowd, even when applying 
weights to the crowd that are inversely related 
to validity. 

Reanalysis of Simmons et al. (2011) 

We now analyze a data set from a study 
which suggested that the crowd is not wise, and 
performs worse than a large majority of indi- 
viduals. Simmons et al. (2011) hypothesized 
that systematic bias in individual's judgments 



can potentially cause the crowd to be unwise 
even when all conditions that typically foster 
wisdom of the crowds hold. To test this hypoth- 
esis, Simmons et al. designed a series of exper- 
iments which use a point spread betting market; 
previous research suggests that crowds in this 
context may not be wise (Kahneman & Freder- 
ick, 2002; Simmons & Nelson, 2006) because 
of individuals having a tendency to bet on fa- 
vorites over underdogs despite the fact that 
point spreads attempt to produce even odds for 
underdogs and favorites (Levitt, 2004; Sim- 
mons & Nelson, 2006). 

Simmons et al. (2011) ran an experiment 
where they accentuated the effect of this bias to 
choose the favorite by having subjects bet on 
point spreads that were systematically shifted 
(relative to Las Vegas point spreads) to make 
the underdog team have better odds of winning. 
The control condition required subjects to 
choose which team they believed would win 
against the point spread (labeled choice condi- 
tion). The authors attempted to shift the amount 
of bias for choosing the favorite by three exper- 
imental manipulations. The warned choice con- 
dition warned each individual that the point 
spreads have been set incorrectly such that bet- 
ting on the underdog team has the better odds of 
winning. In the estimate condition subjects did 
not bet against the point spread, but instead 
provided an estimate of the final score of the 
game. The estimate is then compared with the 
point spread to infer a choice against the point 
spread. This method is predicted to reduce bias 
by shifting the response mode away from the 
choice between the favorite and underdog 
(which is systematically biased toward choos- 
ing the favorite) to estimating the number of 
points in which they expect the favorite to win. 
Finally, subjects in the choice/estimate condi- 
tion predicted the winner against the point 
spread and then provided an estimate of the final 
score. The choices for this condition are also 
inferred from the estimates. The data consist of 

= 178 individuals (choice: n = 43; warned 
choice: n = 39; estimate: n = 45; choice/ 
estimate: n = 51) betting on 226 games over the 



' Given / observations, the jacttnife procedure that we 
employed computes 7-many estimates of log(R) after elim- 
inating the /'' observation (j = I, ... J). The J estimates are 
used to compute the standard error. See Miller (1974) for 
more details. 
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Table 3 

Break Down of Crowd Expected Loss Estimate 

Crowd Covariance 
Weights (w) Bias covariance with criterion 



Equal weights 



100% crowd 


19.19 


339.76 


783.84 


Most valid individual 


0.58 


1,041.84 


1.486.35 


Most valid 5% 


0.63 


878.50 


1,374.70 


Most valid 25% 


8.44 


678.01 


1,189.61 


Most valid 50% 


8.70 


562.95 


1,070.41 


Least valid 50% 


33.79 


206.62 


497.28 


Least valid 25% 


40.08 


144.04 


311.18 


Least valid 5% 


39.20 


87.67 


-39.56 


Least valid individual 


168.68 


181.70 


-344.63 



Unequal weights 

Proportional to validity 14.31 442.53 925.84 

Inversely proportional to validity 29.54 209.83 533.52 



course of 17 weeks (number of games per week 
varied). 

Simmons et al. analysis. Details of the 
original analysis conducted by Simmons et al. 
are found in Tables 2-4 of Simmons et al. 
(2011). Their results show that in both choice 
conditions (regardless of warning), the crowd is 
biased and picks the favorite far too often (Sim- 
mons et al. Table 2), and as a result, the crowd 
predictions have fewer wins (Simmons et al. 
Table 3) and outperform a very small percent- 
age of individuals (Simmons et al. Table 4). 
They report that the percent of individuals beat 
by the crowd are 7% for choice, 0% for warned 
choice, 57.8% for estimate, and 35.2% for 
choice/estimate.'' These results suggest that the 
crowds in the choice and warned choice condi- 
tions are highly biased and as a result, not wise. 

Expected loss analysis. We apply the ex- 
pected loss metric to test crowd performance in 
each of the conditions outlined above. This 
methodology has the unique advantage of 
showing the contribution of bias to the perfor- 
mance of a crowd, a main objective of Simmons 
et al. (2011). We take a slightly different ap- 
proach to the data analysis that fits the statistical 
assumptions of our framework more closely by 
computing our estimates on the percent of fa- 
vorites chosen in each of the 17 weeks of bet- 
ting. This is a continuous measure with a mean 
and variance that can be estimated across the 17 
weeks of betting. Each individual judgment is 
the percent of choices for the favorite made for 
each of the 17 weeks of betting, denoted a,-,, and 



the criterion is the percent of times the favorite 
actually wins in each of the 17 weeks, denoted 
Vy. Some subjects were missing a large number 
of estimates, so we included only subjects who 
had missing data for less than 50% of the bets to 
ensure that all interindividual covariances could 
be computed. Our sample used = 164, elim- 
inating only 14 individuals. 

The parameters in Inequality (2) are esti- 
mated by modeling the true percent of favorites 
winning against a point spread as a stochastic 
process drawn from a distribution with a fixed 
mean and variance for each week. The estimate 
of the mean and variance of the criterion are 
computed as the sample mean and sample vari- 
ance of the true weekly percent of favorites 
winning. The subjects' mean judgment is com- 
puted as the sample mean of their judgments 
and the individual variance and covariance are 
computed as the sample covariance matrix be- 
tween the 164 individuals. Finally, the validity 
is computed as the covariance between the in- 
dividual judgments of the percent of favorites 
winning each week and the actual percent of 
favorites winning each week. These estimates 
are used to compute the LHS and RHS of In- 
equality (2), the measure of performance, 
\og{R), the MSE between the judged and true 



The percentages differ depending on the method used to 
compute them, we report the results based on the Simmons 
et al. counting/median method. 
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Table 4 

Computation of Expected Loss Using Equal Weights and Equal Probabilities for 
Each Individual 





Crowd 


Individual 




Proportion of individuals 


Condition 


loss 


loss 


log(R) iSE) 


beat by the crowd 


Choice 


0.07 


0.11 


0.40 (0.01) 


0.71 


Warned choice 


0.07 


0.10 


0.40 (0.01) 


0.66 


Estimate 


0.02 


0.08 


1.16(0.02) 


1.00 


Choice/estimate 


0.03 


0.08 


1.06 (0.01) 


0.96 



Note. Estimates of log(R) are accompanied by SE& produced from the jackknife procedure. 



percent of favorites winning each week and the 
percent of individuals beat by the crowd. 

Results. Table 4 presents the expected loss 
for an equally weighted crowd versus randomly 
choosing an individual from the crowd, in each 
of the four experimental conditions. Our analy- 
sis supports the hypothesis of Simmons et al. by 
demonstrating that the choice and warned 
choice condition crowds performed worse than 
the estimate and choice/estimate condition 
crowds. However, our results also contradict the 
Simmons et al. results in that crowds have lower 
expected loss than a randomly selected individ- 
ual and the crowd outperforms more than 50% 
of individuals in all of the conditions. In other 
words, all four conditions produced wise crowds 
based on expected loss. Finally, Table 5 presents 
the breakdown of the expected loss term for the 
crowd. The inferiority of the choice and warned 
choice conditions is driven by a much larger bias 
term, as predicted by Simmons et al. 

Why do the two methods produce different 
results? There is a large discrepancy between 
the percent of individuals outperformed by the 
crowd calculated by Simmons et al. (201 1) (Ta- 
ble 2), and the percentages obtained by our 
analysis (Table 5). This difference can be attrib- 
uted to the different metrics used to evaluate 
crowd performance. The Simmons et al. method 



Table 5 

Breakdown of the Crowd Expected Loss Estimate 









Crowd 




Crowd 


Crowd 


covariance 


Condition 


bias 


covariance 


with criterion 


Choice 


0.047 


0.002 


-0.005 


Warned choice 


0.043 


0.002 


-0.004 


Estimate 


0.001 


0.001 


-0.003 


Choice/estimate 


0.001 


0.002 


-0.003 



for generating crowd prediction is based on 
averaging all individual predictions for each 
individual game. A crowd choice for the favor- 
ite is produced when more than 50% of the indi- 
viduals chose the favorite for that game and a 
crowd choice for the underdog is produced when 
more than 50% of the individuals chose the un- 
derdog for that game. This is a "majority choice 
rule" that is sensitive only to the average being 
above/below a threshold, but insensitive to the 
magnitude of the distances from the threshold. 

Figure 4 plots the proportion of individuals 
beat by the crowd for all possible majority 
choice rules. The wisdom of the crowd changes 
by varying the majority choice rule from 0% to 
100% of choices for the favorite required to 
indicate a crowd choice for the favorite. For 
example, at the 0% choice rule, the crowd 
chooses the favorite for each game and per- 
forms at the base rate level of 43% correct 
choices (i.e., 43% of the favorites win in this 
experiment). At the 100% choice rale, the crowd 
chooses the underdog for each game, and per- 
forms at the base rate level of 57% correct. Figure 
4 clearly shows that any choice rale above 60% 
produces a crowd that beats more than 50% of the 
individual members for all conditions. While Fig- 
ure 4 can reveal that the top two panels exhibit 
more bias by shifting the step function to the right, 
it does not definitively reveal if any of the crowds 
are more or less wise. 

Extensions to Small Groups Versus 
the Crowd 

Our definition and analysis has, thus far, been 
restricted to the case of comparing a group 
prediction to that of a randomly selected indi- 
vidual. The framework itself could be general- 
ized in a few directions with very minor modi- 
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Figure 4. The proportion of individuals beat by the crowd for majoiity choice decision rules 
ranging from 0% to 100%. 



fications. For example, we could consider the 
predictive accuracy of an aggregate of a small 
group of talented decision makers compared 
with the overall crowd. Certainly, a small group 
comprised of experts has the potential to out- 
perform a larger crowd comprised of less tal- 
ented members. Yet, for the small group of 
experts, it is reasonable to ask if the relative 
boost in accuracy for predicting the expected 
value of Y would outweigh the potentially 
greater gains in variance reduction by incorpo- 
rating a larger number of less-talented group 
members. Also, performance for the small 
group could be further diminished if the expert 
predictions are highly correlated (e.g., Broomell 
& Budescu, 2009). 

We could examine such cases by defining a 
new set of weights, w*,j G {1, 2, . . . , A^}, that 
correspond to the aggregate weighting of a 
small group of experts. For simplicity, we will 
consider the case of a small group of A:-many 
experts such that these ^-many experts are 
members of the larger crowd. Let the larger 
crowd's prediction random variable, C, be de- 



fined by the weighting scheme w„ ; G 
{1, 2, . . . ,N}. The small group weighting 
scheme, Wj,j E {1, 2, . . . , A^}, will necessarily 
have - A: many zero weights, corresponding to 
the - A: many individuals that are not members 
of the small group of experts. Let C* be the 
prediction random variable defined by the 
weights, w*j,jG {1, 2, . . ., A^}. Following a sim- 
ilar structure to Inequality (1), we could use the 
inequality, E[(C* - Y)^] < E[C - Y)'], as our 
definition of "small group wisdom." Given a set 
of crowd and small group weights, one could 
examine whether, and to what extent, the result- 
ing inequality holds: 

where w* is the N X 1 vector of weights, 
w*, j G{1,2, . . . ,N}, defining C*. 

The weights determining the small group, 
w'j, j G {1, 2, . . . , A^}, could be determined ei- 
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ther deterministically or as a function of the 
correlation with Y (as in the individual selection 
mechanism in previous sections) or through 
some other process. For example, Budescu and 
Chen (2012) consider the case of weighting 
only the upper 50% of judges who make a 
positive contribution to the crowd (roughly the 
top 50%). As in the previous section, the devi- 
ation in accuracy for one set of weights over the 
other could be evaluated by taking the natural 
logarithm of the ratio between the left- and 
right-hand sides of the above inequality. This 
result allows us, a priori, to evaluate the effects 
of varying the intercorrelation among the small 
group DMs, the biases of their predictions, and 
the number of small group members. 

Discussion 

We have presented a precise definition of the 
"wisdom of the crowds" effect as well as a math- 
ematical framework in which to evaluate it. We 
define a crowd as wise if a linear combination of 
member predictions is, on average, closer to the 
criterion value than the prediction of a single 
member who is selected according to a prespeci- 
fied probability distribution. Our definition can be 
simply stated as an inequality (Proposition 1). 

Given the popularity and ubiquity of the 
"wisdom of the crowds" (over 9 million hits on 
Google) one may be tempted to downplay the 
importance of this contribution. However, it is 
important to realize that in the vast majority of 
instances the effect is either not precisely de- 
fined or not defined at all. In fact, one could say 
that, like obscenity (Jacobellis v. Ohio, 1964) it 
is easy to recognize wisdom of crowds when 
seeing it, but rather hard to define it. In partic- 
ular, the appropriate way to assess the crowd's 
performance is not clear because it is not obvi- 
ous what is the proper comparative benchmark. 
Larrick et al. (2012) took an important first step 
in this direction by comparing the mean of the 
judges with the mean judge. Our article extends 
and generalizes this definition and illustrates its 
application theoretically and empirically. 

Under this definition, we can specify boundary 
conditions on the wisdom of certain methods of 
crowd aggregation. Analyzing special cases of the 
framework, including different rules for combin- 
ing judgments, different rules for selecting indi- 
viduals against which to compare the crowd's 
judgment, cases of biased crowd members, and 



correlated crowd member judgments, we confirm 
that even a simple crowd average is robustly wise. 
Indeed, for large groups, nearly deterministic se- 
lection of a highly skilled individual DM is neces- 
sary before a crowd average is unwise. Of course, as 
Result 1 shows, there always exists a wise aggrega- 
tion rule for every individual selection mle. 

An advantage of our approach is that it can 
predict when a crowd will be wise prior to any 
data collection. In this manner, our framework can 
guide the a priori construction of an optimal (i.e., 
maximally wise) group. Because our framework 
can accommodate a wide set of constraints, for 
example, various bias configurations and essen- 
tially any pattern of interjudge correlations, it can 
be tailored for particular problems and environ- 
ments that do not fit the classic conditions for 
crowd wisdom. Also, our definition of a crowd 
prediction is sufficiently general and flexible to 
accommodate robust aggregation measures based 
on trimmed or Windsorized means (e.g., Jose & 
Winkler, 2008), or medians (Hora, Fransen, 
Hawkins, & Susel, 2012). 

Our general results are limited to the use of the 
squared error accuracy metric. Future work could 
consider alternative accuracy metrics, such as av- 
erage absolute accuracy. This would require addi- 
tional assumptions on the prediction distributions, 
but, in principle, our general approach could be 
extended to any well-defined accuracy metric. In 
addition, one could consider alternative general- 
izations, such as comparing the crowd perfor- 
mance with the best performing individual. 

One perhaps surprising conclusion that emerges 
is that, contrary to the extant literature that uses 
the case of uncorrelated judges as the baseline 
(e.g., Clemen & Winkler, 1986; Hogarth, 1978), 
we find that a group is wisest, all things equal, 
when it is maximally "diverse" in that its members 
are as negatively correlated as possible. Though 
we begin with different motivations and use dif- 
ferent mathematics, this result confirms earlier 
literature suggesting that diverse groups perform 
better (see Hong & Page, 2004). Why is diversity 
so important? A helpful analogy is to think of a 
group like a financial portfolio whose members 
are assets. It is useful to hedge one's bets by 
holding some assets that are negatively correlated 
with the rest of the portfolio, so that there are some 
positive retums when other assets perform poorly. 
Similarly, we find that wise groups should include 
some judges who predict better when others falter. 
In large groups, there are considerable mathemat- 
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ical constraints on how negatively correlated 
judges can be with each other. In these cases, the 
rule reduces to maximal performance when all 
judges are uncorrelated, which, under normality 
assumptions, implies statistical independence. 

When applying our theory, it is important to 
distinguish between judges being independent and 
their judgments being uncorrelated. The crowd 
wisdom literature stresses the importance of inde- 
pendence — having the judges generate predic- 
tions without consulting, conferring and commu- 
nicating — but this does not imply that their 
quantitative predictions will be uncorrelated. In- 
deed, practically all the empirical literature shows 
that experts in all domains are highly and posi- 
tively correlated (Ashton, 1986; Clemen & Win- 
kler, 1986; Winkler, 1971; Winkler & Poses, 
1993). Broomell and Budescu (2009) describe the 
sources of these interjudge correlations, such as 
access to common information, intercorrelated 
cues, and similar training of the experts. Broomell 
and Budescu go on to illustrate how unlikely it is 
to find uncorrelated judges. Our framework is well 
suited to modeling such situations as it naturally 
accommodates correlated judges. 

In light of these constraints, the relevance of 
skill-diversity trade-offs becomes apparent. Hav- 
ing skilled members in the group is important, but 
in the presence of some skilled members, it be- 
comes more important to add members with truly 
different perspectives and/or access to other 
sources of information. Diversity of this sort is 
highly valuable to crowd wisdom. Our framework 
offers a systematic method of investigating the 
precise conditions under which a crowd is no 
longer wise. This allows us to answer questions of 
the form: Given a specified level of intercorrela- 
tions among the group members, how much mem- 
ber bias can be tolerated before the group is no 
longer wise (or vice versa)? A numerical example 
is helpful to illustrate. Consider a group with five 
unbiased members, who predictions highly inter- 
correlate with one another at .7. Suppose these 
five members are all skilled with a correlation of 
.5 with the criterion (assume the criterion random 
variable has a variance equal to 1). For this group, 
adding another member with identical attributes 
will create a six member group with an expected 
squared error value of 22. However, adding a less 
skilled member who correlates with the criterion 
at.l, but who is also less correlated with the other 
group members at .2, will yield a six member 
group with an expected squared error value of 



17.8. For this example, adding a much less skilled 
member who created more diversity in the crowd 
yielded a more accurate crowd than adding an- 
other, much more skilled group member. 

In our analyses, we found that the direction, 
pattern, and magnitude of individual biases all 
played a role in determining crowd wisdom. How- 
ever, the overall effect of crowd wisdom was sur- 
prisingly robust to individual bias overall. In other 
words, unless one could identify nearly determinis- 
ticaUy the best individual, who must be quite skilled 
(high correlation with the criterion), one is still 
better off using an unweighted aggregate. We 
confirmed these results with a reanalysis of an 
empirical study (Simmons et al., 2011) in which 
participants made systematically biased predictions. 

Given our results, we conclude that, in general, 
extraordinary evidence is needed to justify choosing 
an expert's judgment over the aggregate of a crowd. 
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Appendix 
Proof of Result 1. 

The result follows if we show that the inequality from Proposition 1 always holds when w, = 
Pi, G {1, 2, . . . , A^}. Recall that we assume w, > 0, V( G {1, 2, . . . , N}, and that ^l^i W/ = 1. 
More precisely, we demonstrate that the following inequality always holds: 

N \2 N N N N 

/=1 / !=1 7=1 (=1 ■ /=1 

. -o 

'% % Expanding terms and simplifying gives, 

g.| I 'N \2 N N N N 
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1 I The right-hand side of the above inequality is always non-negative by Jensen's inequality. 
< 2 Hence, we need only show that the left-hand side of the above inequality is non-positive and the 

S ^ result will follow. The left-hand side of the above inequality being non-positive is equivalent to the 

^ % following. 
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< 1 which holds if, and only if. 
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To prove that the above inequality always holds, consider the maximal sum on the left-hand side. 
Because 2xx is positive semi-definite, Ictj, ^,1 is bounded above by -(cr^, + cr^;) for all 

{1,2,..., nY, i + j. Substituting this upper bound for all ct<!— <! — >— >a,,v/ terms gives the 
following, 
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which, by symmetry of S^x, equals the following. 
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Finally, by expanding the terms on the left-hand side we will demonstrate that 
W/VVjCT^,- = 2/^1 '^/(l ~ ^^'^ that the previous inequality is, in fact, an equality. 

Expanding the left-hand side, we obtain 

2/2 2 2\/2 2 

2j '^i^i^xi ~ i'^1^20'vl W^WTfJ^^ -I- ... -I- WjW^^ij + (H'2H'iCr^.2 + W2WTfy ^2 ■ • ■ 
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and the main result follows, thus completing the proof. Q 
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